This tutorial is a continuing work in progress, but we are excited you are using this resource to get started with Genome Wide Association Studies.
This tutorial was inspired by similar work in Reed2015 and Marees2017. Unfortunately, the software used in the 2015 tutorial is a bit out of date now – several R packages have changed or are no longer available – and it needed an update.
This tutorial is broken down into four sections: 1) Data formats, summary statistics and quality control, 2) Imputation and population structure, 3) SNP Testing, and 4) Post analysis and biological relevance.
This section covers the basics from different file types to initial things to look for in the data and how to exclude certain parts of the data that would muddy our end results.
The next section covers some important considerations that should eb taken into account before completing the analysis but which are not necessarily covered under quality control. Imputation is a method underwhich missing data is replaced using a logical method. Population structure helps us take into account possible confounders that may exist in our dataset that would incorrectly inflate results such as familial relations.
SNP testing gets down the results we are looking for: associated loci for our phenotypes of interest. This is the easy part if we complete the prior two sections correctly.
Finally, we attempt to take the results and turn them into something meaningful. For instance, I could tell you some SNP, but it will have no mean to you unless you have spent time studying it. On the other hand, if I tell you that a variant of a SNP is associated with improper functioning of chloride channels, that will have a great deal more significance.